Skip to main content

Spot instances and recovering from shutdown

Requesting spot or preemptible instances is a way of reducing the compute cost while using powerful instances but the cloud provider can take back the instance at any moment and disrupt the workload.

Requesting a spot instance with dataplane

On AWS EKS or Azure AKS

Requesting a spot instance has to be requested from the manifest:

...
spec:
...
types:
Worker:
...
resources:
...
extraSelectors:
karpenter.sh/capacity-type: spot

By default AIchor experiments will request karpenter.sh/capacity-type: on-demand.

On GCP GKE

...
spec:
...
types:
Worker:
...
resources:
...
extraSelectors:
cloud.google.com/gke-spot: "true"
cloud.google.com/gke-provisioning: spot
# on GKE, a toleration also has to be passed
extraTolerations:
- key: "cloud.google.com/gke-spot"
operator: "Equal"
value: "true"
effect: "NoSchedule"

Recovering from eviction

Recovering from eviction is supported by 2 AIchor operators: jobset and kuberay. You can find some demo projects using these 2 operators here and here. Other operators will fail when the eviction happens.

spec:
operator: jobset # or kuberay

restartPolicy:
backoffLimit: 5

In the snippet code above, spec.restartPolicy.backoffLimit represents the number of allowed restarts, this experiments will be able to handle 5 failures (including eviction) before being marked as failed.

Side note for the other operators

spec.restartPolicy.backoffLimit on the other operators (jax,pytorch,...) only covers a software failure, when the exit status of spec.command is a non-zero code. It will re execute the command in the same container, on the same node.

Checkpointing

During training, the most reliable method to ensure progress is not lost is to periodically save checkpoints to an external storage backend (like AIchor S3 buckets). Then, when restarting the software should be able to automatically recover from the latest checkpoint to resume training.This way, you ensure a reliable training while using spot instances.